feat(rust): Prune row groups before loading all columns #13746

bchalk101 · 2024-01-15T14:23:21Z

This addresses #13608.

This addresses the situation when the row group statistics don't help with filtering out row groups. By loading just the columns required to apply the predicate and filtering out non-required row groups, followed by loading all the data from the leftover row groups.

The main downside of this implementation is that in the bad case, where the number of rows for the predicate and projection are equal the data is being downloaded twice. To get around this, perhaps a feature flag can be added, something like ROW_GROUP_PRUNING, and only if it is turned on will this filtering be applied.

I will note, that on the datasets I am currently working on, as specified in the issue, the filtering went from 25min and 32 GB memory consumption to 25 seconds and negligible memory.

Is it safe? Handles all situations when there is no predicate
Feature flagged
Tested

bchalk101 · 2024-01-16T11:38:01Z

Hey @ritchie46,
Would be great to get your take on this enhancement.
Is this something you will be ok merging in? Is this being implemented in the right direction?

Just want to check before putting more effort into this and implementing the feature toggle and testing.

ritchie46 · 2024-01-23T14:38:33Z

Hi @bchalk101 why was this closed? Didn't it work?

bchalk101 · 2024-01-24T07:59:21Z

Hey @ritchie46,
No, it does work. I just wasn't sure if it would get merged in and wanted to test out some other optimizations, so I closed it. But happy to re-open it again.

mkleinbort · 2024-02-02T15:23:41Z

Hi. Very interested in this feature - it'd be amazing for some very wide tables.

ritchie46 · 2024-02-03T17:23:45Z

Yes, it is interesting, but I want to think about this a bit more. As the most common case would slow down. We loose an embarrassingly parallel load to a sequential one with a very low probability of being faster.

bchalk101 · 2024-02-04T07:42:41Z

I want to give some further insight into our use case, which perhaps can help with the decision. We are using Parquets as the format for saving data for training ML models, this means that each row can be quite large, even if some columns are small. For example, large compressed numpy arrays or even jpeg images in a column. There is still small metadata in each row, which is what we use with the filters before actually selecting the rows. While I think we would be a small percentage of users, there is definitely a large number of people using Parquets for ML training data.

I would be open to other ways to use this with Polars, potentially a secondary read library suited to such data. The issue is that it makes it very difficult to use Polars with such data without a bunch of changes. Other changes may include, applying limits (is .limit) on the pruned row groups before collecting more data (which again will slow down the general use case), implementing a generator for loading data or allowing setting parquet size when sinking data into files.

The other option, as I mentioned in the description is to use a feature flag, but of course, this can lead to "flag bloat".

codspeed-hq · 2024-04-11T06:33:51Z

CodSpeed Performance Report

Merging #13746 will not alter performance

_{Comparing bchalk101:optimize_rg_read (c77cf72) with main (11fe9d8)}

Summary

✅ 34 untouched benchmarks

…#13608)

kszlim · 2024-09-09T22:15:50Z

Is this the same optimization as Late Materialization?

bchalk101 · 2024-09-10T16:05:32Z

Not exactly - it's the same idea, but done at the IO level.
The process is as follows:

Download predicate-only columns
Apply predicate to decide if the rest of the required columns should be downloaded
Mark the Row Group as required and use it to apply the slice.
Download the entire Row Group and apply the predicate (At this point late materialisation would help)

This specifically helps when I/O and memory are the blockers, ie wide tables with columns that contain heavy data.

ritchie46 · 2024-09-11T08:19:59Z

I will close this one as this isn't getting merged anymore. I will see us dynamically adapting to such a strategy in the new streaming engine, but for now it's out of scope.

bchalk101 force-pushed the optimize_rg_read branch from 8ac4d95 to 2621b7b Compare January 16, 2024 08:01

bchalk101 changed the title ~~Prune row groups before loading all columns~~ feat(rust): Prune row groups before loading all columns Jan 16, 2024

github-actions bot added enhancement New feature or an improvement of an existing feature rust Related to Rust Polars labels Jan 16, 2024

bchalk101 force-pushed the optimize_rg_read branch 6 times, most recently from 75e7a1a to 872956f Compare January 21, 2024 08:57

bchalk101 marked this pull request as ready for review January 21, 2024 08:59

bchalk101 requested review from ritchie46, stinodego, c-peters, alexander-beedie, MarcoGorelli and orlp as code owners January 21, 2024 08:59

bchalk101 force-pushed the optimize_rg_read branch 2 times, most recently from 057cdf0 to a8c771b Compare January 23, 2024 08:08

bchalk101 closed this Jan 23, 2024

bchalk101 reopened this Jan 24, 2024

bchalk101 force-pushed the optimize_rg_read branch 3 times, most recently from cace8b0 to 480a13f Compare January 29, 2024 13:03

bchalk101 force-pushed the optimize_rg_read branch from 4926c9a to e6f4d8b Compare March 4, 2024 07:47

bchalk101 force-pushed the optimize_rg_read branch 3 times, most recently from f2d0215 to aeff35b Compare March 17, 2024 08:03

bchalk101 force-pushed the optimize_rg_read branch 2 times, most recently from 2c357ce to 99c566e Compare March 26, 2024 10:36

bchalk101 force-pushed the optimize_rg_read branch from da5a159 to 017d65e Compare April 4, 2024 06:56

bchalk101 requested a review from reswqa as a code owner April 4, 2024 06:56

bchalk101 force-pushed the optimize_rg_read branch from 017d65e to b91ddda Compare April 11, 2024 05:56

bchalk101 force-pushed the optimize_rg_read branch from b91ddda to c9b217e Compare May 5, 2024 07:04

bchalk101 force-pushed the optimize_rg_read branch from adfea3b to c77cf72 Compare May 16, 2024 16:03

bchalk101 force-pushed the optimize_rg_read branch from 59aa55a to 628c6ba Compare May 23, 2024 16:58

bchalk101 force-pushed the optimize_rg_read branch 3 times, most recently from c04ae0f to a746135 Compare June 3, 2024 07:34

bchalk101 force-pushed the optimize_rg_read branch 2 times, most recently from 3b8398b to 55e3623 Compare July 4, 2024 11:59

bchalk101 force-pushed the optimize_rg_read branch 2 times, most recently from e97142e to 2b960f6 Compare July 11, 2024 08:22

ritchie46 force-pushed the main branch from 0a696ff to 9c29683 Compare July 28, 2024 08:11

bchalk101 force-pushed the optimize_rg_read branch from 2b960f6 to 1200460 Compare August 18, 2024 17:23

bchalk101 force-pushed the optimize_rg_read branch from 631b961 to ae4ed76 Compare September 5, 2024 07:43

feat(rust): optimize column load of row groups with predicate(pola-rs…

5823b25

…#13608)

bchalk101 force-pushed the optimize_rg_read branch from ae4ed76 to 5823b25 Compare September 5, 2024 12:11

ritchie46 closed this Sep 11, 2024

bchalk101 mentioned this pull request Oct 9, 2024

Materialize predicate columns before projection columns #13608

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(rust): Prune row groups before loading all columns #13746

feat(rust): Prune row groups before loading all columns #13746

bchalk101 commented Jan 15, 2024 •

edited

Loading

bchalk101 commented Jan 16, 2024

ritchie46 commented Jan 23, 2024

bchalk101 commented Jan 24, 2024

mkleinbort commented Feb 2, 2024

ritchie46 commented Feb 3, 2024

bchalk101 commented Feb 4, 2024

codspeed-hq bot commented Apr 11, 2024 •

edited

Loading

kszlim commented Sep 9, 2024

bchalk101 commented Sep 10, 2024

ritchie46 commented Sep 11, 2024

feat(rust): Prune row groups before loading all columns #13746

feat(rust): Prune row groups before loading all columns #13746

Conversation

bchalk101 commented Jan 15, 2024 • edited Loading

bchalk101 commented Jan 16, 2024

ritchie46 commented Jan 23, 2024

bchalk101 commented Jan 24, 2024

mkleinbort commented Feb 2, 2024

ritchie46 commented Feb 3, 2024

bchalk101 commented Feb 4, 2024

codspeed-hq bot commented Apr 11, 2024 • edited Loading

CodSpeed Performance Report

Merging #13746 will not alter performance

Summary

kszlim commented Sep 9, 2024

bchalk101 commented Sep 10, 2024

ritchie46 commented Sep 11, 2024

bchalk101 commented Jan 15, 2024 •

edited

Loading

codspeed-hq bot commented Apr 11, 2024 •

edited

Loading